21 research outputs found
Multi-stream CNN based Video Semantic Segmentation for Automated Driving
Majority of semantic segmentation algorithms operate on a single frame even
in the case of videos. In this work, the goal is to exploit temporal
information within the algorithm model for leveraging motion cues and temporal
consistency. We propose two simple high-level architectures based on Recurrent
FCN (RFCN) and Multi-Stream FCN (MSFCN) networks. In case of RFCN, a recurrent
network namely LSTM is inserted between the encoder and decoder. MSFCN combines
the encoders of different frames into a fused encoder via 1x1 channel-wise
convolution. We use a ResNet50 network as the baseline encoder and construct
three networks namely MSFCN of order 2 & 3 and RFCN of order 2. MSFCN-3
produces the best results with an accuracy improvement of 9% and 15% for
Highway and New York-like city scenarios in the SYNTHIA-CVPR'16 dataset using
mean IoU metric. MSFCN-3 also produced 11% and 6% for SegTrack V2 and DAVIS
datasets over the baseline FCN network. We also designed an efficient version
of MSFCN-2 and RFCN-2 using weight sharing among the two encoders. The
efficient MSFCN-2 provided an improvement of 11% and 5% for KITTI and SYNTHIA
with negligible increase in computational complexity compared to the baseline
version.Comment: Accepted for Oral Presentation at VISAPP 201
Self-Supervised Online Camera Calibration for Automated Driving and Parking Applications
Camera-based perception systems play a central role in modern autonomous
vehicles. These camera based perception algorithms require an accurate
calibration to map the real world distances to image pixels. In practice,
calibration is a laborious procedure requiring specialised data collection and
careful tuning. This process must be repeated whenever the parameters of the
camera change, which can be a frequent occurrence in autonomous vehicles. Hence
there is a need to calibrate at regular intervals to ensure the camera is
accurate. Proposed is a deep learning framework to learn intrinsic and
extrinsic calibration of the camera in real time. The framework is
self-supervised and doesn't require any labelling or supervision to learn the
calibration parameters. The framework learns calibration without the need for
any physical targets or to drive the car on special planar surfaces
Fast and Efficient Scene Categorization for Autonomous Driving using VAEs
Scene categorization is a useful precursor task that provides prior knowledge
for many advanced computer vision tasks with a broad range of applications in
content-based image indexing and retrieval systems. Despite the success of data
driven approaches in the field of computer vision such as object detection,
semantic segmentation, etc., their application in learning high-level features
for scene recognition has not achieved the same level of success. We propose to
generate a fast and efficient intermediate interpretable generalized global
descriptor that captures coarse features from the image and use a
classification head to map the descriptors to 3 scene categories: Rural, Urban
and Suburban. We train a Variational Autoencoder in an unsupervised manner and
map images to a constrained multi-dimensional latent space and use the latent
vectors as compact embeddings that serve as global descriptors for images. The
experimental results evidence that the VAE latent vectors capture coarse
information from the image, supporting their usage as global descriptors. The
proposed global descriptor is very compact with an embedding length of 128,
significantly faster to compute, and is robust to seasonal and illuminational
changes, while capturing sufficient scene information required for scene
categorization.Comment: Published in the 24th Irish Machine Vision and Image Processing
Conference (IMVIP 2022
Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving
This short paper presents a preliminary analysis of three popular Visual
Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the
context of answering questions relating to driving scenarios. The performance
of these models is evaluated by comparing the similarity of responses to
reference answers provided by computer vision experts. Model selection is
predicated on the analysis of transformer utilization in multimodal
architectures. The results indicate that models incorporating cross-modal
attention and late fusion techniques exhibit promising potential for generating
improved answers within a driving perspective. This initial analysis serves as
a launchpad for a forthcoming comprehensive comparative study involving nine
VQA models and sets the scene for further investigations into the effectiveness
of VQA model queries in self-driving scenarios. Supplementary material is
available at
https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving
Near Field iToF LIDAR Depth Improvement from Limited Number of Shots
Indirect Time of Flight LiDARs can indirectly calculate the scene's depth
from the phase shift angle between transmitted and received laser signals with
amplitudes modulated at a predefined frequency. Unfortunately, this method
generates ambiguity in calculated depth when the phase shift angle value
exceeds . Current state-of-the-art methods use raw samples generated
using two distinct modulation frequencies to overcome this ambiguity problem.
However, this comes at the cost of increasing laser components' stress and
raising their temperature, which reduces their lifetime and increases power
consumption. In our work, we study two different methods to recover the entire
depth range of the LiDAR using fewer raw data sample shots from a single
modulation frequency with the support of sensor's gray scale output to reduce
the laser components' stress and power consumption
Revisiting Modality Imbalance In Multimodal Pedestrian Detection
Multimodal learning, particularly for pedestrian detection, has recently
received emphasis due to its capability to function equally well in several
critical autonomous driving scenarios such as low-light, night-time, and
adverse weather conditions. However, in most cases, the training distribution
largely emphasizes the contribution of one specific input that makes the
network biased towards one modality. Hence, the generalization of such models
becomes a significant problem where the non-dominant input modality during
training could be contributing more to the course of inference. Here, we
introduce a novel training setup with regularizer in the multimodal
architecture to resolve the problem of this disparity between the modalities.
Specifically, our regularizer term helps to make the feature fusion method more
robust by considering both the feature extractors equivalently important during
the training to extract the multimodal distribution which is referred to as
removing the imbalance problem. Furthermore, our decoupling concept of output
stream helps the detection task by sharing the spatial sensitive information
mutually. Extensive experiments of the proposed method on KAIST and UTokyo
datasets shows improvement of the respective state-of-the-art performance.Comment: 5 pages, 3 figure, 4 table